Step 0:
A few words of caution:
1) Read all the way through the instructions.
2) Models must be built using Python.
3) No additional data may be added or used.
4) Not all data must be used to build an adequate model, but making use of complex variables will help us identify high-performance candidates.
5) The predictions returned should be the class probabilities for belonging to the positive class, not the class itself (i.e. a decimal value, not just 1 or 0). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset.
Step 1:
Clean and prepare your data: There are several entries where values have been deleted to simulate dirty data. Please clean the data with whatever method(s) you believe is best/most suitable. Note that some of the missing values are truly blank (unknown answers).
Step 2:
Build your models: Please build two distinctly different machine learning/statistical models to predict the value for y. When writing the code associated with each model, please have the first part produce and save the model, followed by a second part that loads and applies the model.
Step 3:
Create predictions on the test dataset using both of your trained models. The predictions should be the class probabilities for belonging to the positive class (labeled ë1Ã). Be sure to output a prediction for EACH of the 10,000 rows in the test dataset. Save the results of the two models in a separate CSV files titled ìresults1.csvî and ìresults2.csvî. A result file should each have a single column representing the output from one model.
Step 4:
Submit your work: In addition to the two result files (CSV format), please submit all of your code for cleaning, prepping, and modeling your data (text, html, or PDF preferred), and a brief write-up comparing the pros and cons of the two modeling techniques you used (PDF preferred).
Please do not submit the original data back to us. Your work will be scored on techniques used (appropriateness and complexity), model performance - measured by AUC - on the data hold out, an understanding of the two techniques you compared in your write-up, and your overall code.
import numpy as np
import pandas as pd
# models
import sklearn
from sklearn import manifold
from sklearn import naive_bayes
from sklearn import svm
from sklearn import ensemble
# plots
import cufflinks as cf
import plotly
import plotly.io as pio
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
cf.set_config_file(offline=True, world_readable=True)
# to make this notebook's output identical at every run
np.random.seed(42)
plt_filepath = 'plots/EDA/'
print('numpy version: {}'.format(np.__version__))
print('pandas version: {}'.format(pd.__version__))
print('sklearn version: {}'.format(sklearn.__version__))
print('cufflinks version: {}'.format(cf.__version__))
print('plotly version: {}'.format(plotly.__version__))
X = pd.read_csv('data/exercise_01_train.csv')
X.head()
"There are {} rows and {} columns in the train set".format(X.shape[0], X.shape[1])
import plotly
plotly.__version__
import plotly
plotly.__version__
y_counts = X.y.value_counts()
data = [go.Bar(
x=y_counts.index,
y=y_counts.values )]
title = 'Y Label: Value Counts'
layout = go.Layout( {'title': title } )
fig = go.Figure(data=data,
layout=layout
)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
X.dtypes.value_counts()
Let's investigate what the object columns are
objects = X.select_dtypes(include='object')
objects = objects.assign(y=X.y)
objects.head()
Columns x45 and x68 are actually numerical -- let's clean these and add those back to the other numerical columns
X.x41.value_counts()[:5]
X['x41'] = pd.to_numeric(X.x41.str.replace('$', ''))
X.x45.value_counts()
X['x45'] = pd.to_numeric(X.x45.str.replace('%', ''))
objects = objects.drop(['x41', 'x45'], axis=1)
objects.head()
For the remaining categorical columns, let's relabel these with more appropriate names
old_labels = ['x34', 'x35', 'x68', 'x93', 'y']
labels = ['car_manufacturer', 'day', 'month', 'market', 'y']
objects.columns = labels
objects.head()
Let's plot the categorical counts against the label to predict
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].car_manufacturer.value_counts()
y_label_1 = y_index.loc[1].car_manufacturer.value_counts()
# create trace1
trace1 = go.Bar(
x = y_label_0.index,
y = y_label_0.values,
name = "Label 0")
# create trace2
trace2 = go.Bar(
x = y_label_1.index,
y = y_label_1.values,
name = "Label 1")
data = [trace1, trace2]
title = "Car Manufacturer Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
Now let's explore the value distribution
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].day.value_counts()
y_label_1 = y_index.loc[1].day.value_counts()
# create trace1
trace1 = go.Bar(
x = y_label_0.index,
y = y_label_0.values,
name = "Label 0")
# create trace2
trace2 = go.Bar(
x = y_label_1.index,
y = y_label_1.values,
name = "Label 1")
data = [trace1, trace2]
title = "Day of the Week Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].month.value_counts()
y_label_1 = y_index.loc[1].month.value_counts()
# create trace1
trace1 = go.Bar(
x = y_label_0.index,
y = y_label_0.values,
name = "Label 0")
# create trace2
trace2 = go.Bar(
x = y_label_1.index,
y = y_label_1.values,
name = "Label 1")
data = [trace1, trace2]
title = "Month Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
y_index = objects.set_index('y')
y_label_0 = y_index.loc[0].market.value_counts()
y_label_1 = y_index.loc[1].market.value_counts()
# create trace1
trace1 = go.Bar(
x = y_label_0.index,
y = y_label_0.values,
name = "Label 0")
# create trace2
trace2 = go.Bar(
x = y_label_1.index,
y = y_label_1.values,
name = "Label 1")
data = [trace1, trace2]
title = "Market Counts by Label"
layout = go.Layout(barmode="group", title=title)
fig = go.Figure(data=data, layout = layout)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
X.dtypes.value_counts()
nums = X.select_dtypes(exclude='object')
nums.head()
nums.describe()
nums_subset_plus_y = nums.iloc[:500,:5]
nums_subset_plus_y = nums_subset_plus_y.assign(y=nums.y)
nums_subset_plus_y.dropna(inplace=True)
nums_subset_plus_y.head(1)
cf.__version__
nums_subset_plus_y.drop('y', axis=1).iplot(title='Line Chart of First 10 Columns')
nums_subset_plus_y.iplot(kind='box', title='Histogram of First 10 Columns')
nums.head()
nums.columns.tolist()[:10]
hist_columns = nums.columns.tolist()[:5]
#hist_columns.append('y')
hist_columns
nums[hist_columns].iplot(kind='hist')
nums_subset_plus_y.head(100).scatter_matrix()
# Create distplot with custom bin_size
fig = ff.create_distplot([nums_subset_plus_y[c] for c in nums_subset_plus_y.columns[:3]], nums_subset_plus_y.columns[:3], bin_size=.25)
title = 'Example Distplot with First 3 Columns'
fig['layout'].update(title=title)
iplot(fig)
pio.write_image(fig, plt_filepath+title+'.png')
Pull clean copy of data and reclean for modeling (clean both train and test the same way)
train_df = pd.read_csv('data/exercise_01_train.csv')
test_df = pd.read_csv('data/exercise_01_test.csv')
print("Train Missing")
(train_df.isnull().sum()/train_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
print("Test Missing")
(test_df.isnull().sum()/test_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
def handle_missing_data(df, drop=False, fill=False, impute=False):
if drop:
return df.dropna()
if fill:
return df.fillna(df.mean())
return df
# add impute instructions
def relabel_data(df):
df['x41'] = pd.to_numeric(df.x41.str.replace('$', ''))
df['x45'] = pd.to_numeric(df.x45.str.replace('%', ''))
df['x34'] = df.x34.str.lower()
df['x35'] = df.x35.str.replace('wed', 'wednesday')
df['x35'] = df.x35.str.replace('thur', 'thursday')
df['x35'] = df.x35.str.replace('fri', 'friday')
df['x68'] = df.x68.str.lower()
df['x68'] = df.x68.str.replace('jun', 'june')
df['x68'] = df.x68.str.replace('aug', 'august')
df['x68'] = df.x68.str.replace('sept.', 'september')
df['x68'] = df.x68.str.replace('apr', 'april')
df['x68'] = df.x68.str.replace('oct', 'october')
df['x68'] = df.x68.str.replace('mar', 'march')
df['x68'] = df.x68.str.replace('nov', 'november')
df['x68'] = df.x68.str.replace('feb', 'february')
df['x68'] = df.x68.str.replace('dev', 'december') # guessing dev = december
df['x93'] = df.x93.str.replace('euorpe', 'europe')
return df
def encode_categoricals(objects, prefix=['cars', 'day', 'month', 'market']):
objects = pd.get_dummies(objects, prefix, dummy_na=True)
return objects
train_df = relabel_data(train_df)
objects = train_df.select_dtypes(include='object')
train_df = train_df.drop(objects.columns.tolist(), axis=1)
objects = encode_categoricals(objects)
train_df = pd.concat([train_df, objects], axis=1)
train_df = handle_missing_data(train_df, drop=True)
test_df = relabel_data(test_df)
objects = test_df.select_dtypes(include='object')
test_df = test_df.drop(objects.columns.tolist(), axis=1)
objects = encode_categoricals(objects)
test_df = pd.concat([test_df, objects], axis=1)
test_df = handle_missing_data(test_df, fill=True)
print("Train Missing")
(train_df.isnull().sum()/train_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
train_df.shape
print("Test Missing")
(test_df.isnull().sum()/test_df.shape[0]).sort_values(ascending=False)[:5] # % are missing data/null
test_df.shape
train_df.dtypes.value_counts()
The int64 type refers to the y label
test_df.dtypes.value_counts()
y = train_df.pop('y')
train_df.to_csv('data/train.csv', index=False)
test_df.to_csv('data/test.csv', index=False)
y.to_csv('data/train_y.csv', index=False)
Correlation Does Not Equal Causation!
train_df_with_label = train_df.assign(y=y)
corr = train_df_with_label.corr()
corr[~corr.index.isin(['y'])].y.sort_values().iplot(kind='bar', title='Correlation w/Y Label')
corr.iplot(kind='heatmap', colorscale='spectral')
Save cleaned data